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Abstract 

We present extensions to a continuous- 
state dependency parsing method that 
makes it applicable to morphologically 
rich languages. Starting with a high- 
performance transition-based parser that 
uses long short-term memory (LSTM) re¬ 
current neural networks to learn repre¬ 
sentations of the parser state, we replace 
lookup-based word representations with 
representations constructed from the or¬ 
thographic representations of the words, 
also using LSTMs. This allows statistical 
sharing across word forms that arc simi¬ 
lar on the surface. Experiments for mor¬ 
phologically rich languages show that the 
parsing model benefits from incorporating 
the character-based encodings of words. 

1 Introduction 

At the heart of natural language parsing is the chal¬ 
lenge of representing the “state” of an algorithm— 
what pa its of a parse have been built and what 
parts of the input string arc not yet accounted for— 
as it incrementally constructs a parse. Traditional 
approaches rely on independence assumptions, de¬ 
composition of scoring functions, and/or greedy 
approximations to keep this space manageable. 
Continuous-state parsers have been proposed, in 
which the state is embedded as a vector (Titov 
and Henderson, 2007; Stenetorp, 2013; Chen and 
Manning, 2014; Dyer et al., 2015; Zhou et al., 
2015; Weiss et al., 2015). Dyer et al. reported 
state-of-the-art performance on English and Chi¬ 
nese benchmarks using a transition-based parser 
whose continuous-state embeddings were con¬ 
structed using LSTM recurrent neural networks 
(RNNs) whose parameters were estimated to max¬ 
imize the probability of a gold-standard sequence 
of parse actions. 


The primary contribution made in this work is to 
take the idea of continuous-state parsing a step fur¬ 
ther by making the word embeddings that are used 
to construct the parse state sensitive to the mor¬ 
phology of the words. 1 Since it it is well known 
that a word’s form often provides strong evidence 
regarding its grammatical role in morphologically 
rich languages (Ballesteros, 2013, inter alia), this 
has promise to improve accuracy and statistical ef¬ 
ficiency relative to traditional approaches that treat 
each word type as opaque and independently mod¬ 
eled. In the traditional parameterization, words 
with similar grammatical roles will only be em¬ 
bedded near each other if they are observed in 
similar contexts with sufficient frequency. Our 
approach reparameterizes word embeddings using 
the same RNN machinery used in the parser: a 
word’s vector is calculated based on the sequence 
of orthographic symbols representing it (§3). 

Although our model is provided no supervision 
in the form of explicit morphological annotation, 
we find that it gives a large performance increase 
when parsing morphologically rich languages in 
the SPMRL datasets (Seddah et al., 2013; Seddah 
and Tsar fat y, 2014), especially in agglutinative 
languages and the ones that present extensive case 
systems (§4). In languages that show little mor¬ 
phology, performance remains good, showing that 
the RNN composition strategy is capable of cap¬ 
turing both morphological regularities and arbi¬ 
trariness in the sense of Saussure (1916). Finally, 
a particularly noteworthy result is that we find that 
character-based word embeddings in some cases 
obviate explicit POS information, which is usually 
found to be indispensable for accurate parsing. 

A secondary contribution of this work is to 
show that the continuous-state parser of Dyer et al. 
(2015) can learn to generate nonprojective trees. 
We do this by augmenting its transition operations 

'Software for replicating the experiments is available 
from https : //github. com/clab/lstm-parser. 



with a SWAP operation (Nivre, 2009) (§2.4), en¬ 
abling the parser to produce nonprojective depen¬ 
dencies which arc often found in morphologically 
rich languages. 

2 An LSTM Dependency Parser 

We begin by reviewing the parsing approach of 
Dyer et al. (2015) on which our work is based. 

Like most transition-based parsers, Dyer et al.’s 
parser can be understood as the sequential manip¬ 
ulation of three data structures: a buffer B initial¬ 
ized with the sequence of words to be parsed, a 
stack S containing partially-built parses, and a list 
A of actions previously taken by the parser. In 
particular, the parser implements the arc-standard 
parsing algorithm (Nivre, 2004). 

At each time step t, a transition action is ap¬ 
plied that alters these data structures by pushing 
or popping words from the stack and the buffer; 
the operations arc listed in Figure 1. 

Along with the discrete transitions above, the 
parser calculates a vector representation of the 
states of B, S, and A\ at time step t these are de¬ 
noted by b/, s*, and a t , respectively. The total 
parser state at t is given by 

p t = max{0, W[st;b f ;a t ] + d} (1) 

where the matrix W and the vector d are learned 
parameters. This continuous-state representation 
pt is used to decide which operation to apply next, 
updating B, S, and A (Figure 1). 

We elaborate on the design of b i? s ( , and a t us¬ 
ing RNNs in §2.1, on the representation of partial 
parses in S in §2.2, and on the parser’s decision 
mechanism in §2.3. We discuss the inclusion of 
SWAP in §2.4. 

2.1 Stack LSTMs 

RNNs arc functions that read a sequence of vectors 
incrementally; at time step t the vector x t is read in 
and the hidden state h f computed using x t and the 
previous hidden state h t _i. In principle, this al¬ 
lows retaining information from time steps in the 
distant past, but the nonlinear “squashing” func¬ 
tions applied in the calcluation of each h t result 
in a decay of the error signal used in training with 
backpropagation. LSTMs are a valiant of RNNs 
designed to cope with this “vanishing gradient” 
problem using an extra memory “cell” (Hochreiter 
and Schmidhuber, 1997; Graves, 2013). 


Past work explains the computation within an 
LSTM through the metaphors of deciding how 
much of the current input to pass into memory 
(i/) or forget (f/). We refer interested readers to 
the original papers and present only the recursive 
equations updating the memory cell c t and hidden 
state h* given xt, the previous hidden state ht_i, 
and the memory cell c t -\\ 

L — T W ih h t _ i + Wj c Ci_i T bj) 

ft = 1 ~ it 
c t = ft O c t _i+ 

i t © tanh(W cx xt + W ch h t -i + b c ) 

W — T W 0 /jh t _i + W oc ct + b c ) 

h t = ot © tanh(cf), 

where cr is the component-wise logistic sig¬ 
moid function and © is the component-wise 
(Hadamard) product. Parameters are all repre¬ 
sented using W and b. This formulation differs 
slightly from the classic LSTM formulation in that 
it makes use of “peephole connections” (Gers et 
al., 2002) and defines the forget gate so that it sums 
with the input gate to 1 (Greff et ah, 2015). To im¬ 
prove the representational capacity of LSTMs (and 
RNNs generally), they can be stacked in “layers.” 
In these architectures, the input LSTM at higher 
layers at time t is the value of h t computed by the 
lower layer (and x* is the input at the lowest layer). 

The stack LSTM augments the left-to-right se¬ 
quential model of the conventional LSTM with a 
stack pointer. As in the LSTM, new inputs arc 
added in the right-most position, but the stack 
pointer indicates which LSTM cell provides cj_i 
and li, i for the computation of the next iterate. 
Further, the stack LSTM provides a pop opera¬ 
tion that moves the stack pointer to the previous 
element. Hence each of the parser data structures 
(. B , S, and A) is implemented with its own stack 
LSTM, each with its own parameters. The values 
of b/, St, and a f arc the h f vectors from their re¬ 
spective stack LSTMs. 

2.2 Composition Functions 

Whenever a REDUCE operation is selected, two 
tree fragments arc popped off of S and combined 
to form a new tree fragment, which is then popped 
back onto S (see Figure 1). This tree must be em¬ 
bedded as an input vector x t . 

To do this. Dyer et al. (2015) use a recursive 
neural network g r (for relation r) that composes 
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Figure 1: Parser transitions indicating the action applied to the stack and buffer and the resulting stack and 
buffer states. Bold symbols indicate (learned) embeddings of words and relations, script symbols indicate 
the corresponding words and relations. Dyer et al. (2015) used the SHIFT and REDUCE operations in their 
continuous-state parser; we add SWAP. 


the representations of the two subtrees popped 
from S (we denote these by u and v), resulting in 
a new vector p r (u, v) or p r (v,u), depending on 
the direction of attachment. The resulting vector 
embeds the tree fragment in the same space as the 
words and other tree fragments. This kind of com¬ 
position was thoroughly explored in prior work 
(Socher et ah, 2011; Socher et al., 2013b; Her¬ 
mann and Blunsom, 2013; Socher et al., 2013a); 
for details, see Dyer et al. (2015). 


nonprojective trees. Here, the inclusion of the 
SWAP operation requires breaking the linearity of 
the stack by removing tokens that are not at the top 
of the stack. This is easily handled with the stack 
LSTM. Figure 1 shows how the parser is capable 
of moving words from the stack (S') to the buffer 
(. B ), breaking the linear order of words. Since a 
node that is swapped may have already been as¬ 
signed as the head of a dependent, the buffer ( B ) 
can now also contain tree fragments. 


2.3 Predicting Parser Decisions 


The parser uses a probabilistic model of parser de¬ 
cisions at each time step t. Letting A(S,B) de¬ 
note the set of allowed transitions given the stack 
S and buffer S (i.e., those where preconditions 
are met; see Figure 1), the probability of action 
z <G A(S, B) defined using a log-linear distribu¬ 
tion: 


P{z I P t) 


exp (gJp t + Qz) 
Ez'eA(S,B) ex P (§?Pt + Qz>) 


( 2 ) 


(where g z and q z are parameters associated with 
each action type z). 

Parsing proceeds by always choosing the most 
probable action from A(S, B). The probabilistic 
definition allows parameter estimation for all of 
the parameters (W*, b* in all three stack LSTMs, 
as well as W, d, g*, and q t ) by maximizing the 
conditional likelihood of each collect parser deci¬ 
sions given the state. 


2.4 Adding the SWAP Operation 

Dyer et al. (2015)’s parser implemented the most 
basic version of the arc-standard algorithm, which 
is capable of producing only projective parse trees. 
In order to deal with nonprojective trees, we also 
add the SWAP operation which allows nonprojec¬ 
tive trees to be produced. 

The SWAP operation, first introduced by Nivre 
(2009), allows a transition-based parser to produce 


3 Word Representations 

The main contribution of this paper is to change 
the word representations. In this section, we 
present the standard word embeddings as in Dyer 
et al. (2015), and the improvements we made gen¬ 
erating word embeddings designed to capture mor¬ 
phology based on orthographic strings. 

3.1 Baseline: Standard Word Embeddings 

Dyer et al.’s parser generates a word representation 
for each input token by concatenating two vectors: 
a vector representation for each word type (w) 
and a representation (t) of the POS tag of the to¬ 
ken (if it is used), provided as auxiliary input to the 
parser. 2 A linear map (V) is applied to the result¬ 
ing vector and passed through a component-wise 
ReLU: 


x = max {0, V[w; t] + b} 


For out-of-vocabulary words, the parser uses an 
“UNK” token that is handled as a separate word 
during parsing time. This mapping can be shown 
schematically as in Figure 2. 

2 Dyer et al. (2015), included a third input representation 
learned from a neural language model (wlm). We do not in¬ 
clude these pretrained representations in our experiments, fo¬ 
cusing instead on character-based representations. 
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Figure 2: Baseline model word embeddings for an 
in-vocabulary word that is tagged with POS tag 
NN (right) and an out-of-vocabulary word with 
POS tag JJ (left). 

3.2 Character-Based Embeddings of Words 

Following Ling et al. (2015), we compute 
character-based continuous-space vector embed¬ 
dings of words using bidirectional LSTMs (Graves 
and Schmidhuber, 2005). When the parser initi¬ 
ates the learning process and populates the buffer 
with all the words from the sentence, it reads the 
words character by character from left to right and 
computes a continuous-space vector embedding 
the character sequence, which is the h vector of 
the LSTM; we denote it by w. The same process 
is also applied in reverse (albeit with different pa¬ 
rameters), computing a similar continuous-space 
vector embedding starting from the last character 
and finishing at the first (w); again each character 
is represented with an LSTM cell. After that, we 
concatenate these vectors and a (learned) represen¬ 
tation of their tag to produce the representation w. 
As in §3.1, a linear map (V) is applied and passed 
through a component-wise ReLU. 

x = max jo, V[w; w; t] + b j 

This process is shown schematically in Figure 3. 

Note that under this representation, out-of- 
vocabulary words arc treated as bidirectional 
LSTM encodings and thus they will be “close” to 
other words that the parser has seen during train¬ 
ing, ideally close to their more frequent, syntacti¬ 
cally similar morphological relatives. We conjec¬ 
ture that this will give a clear advantage over a sin¬ 
gle “UNK” token for all the words that the parser 
does not see during training, as done by Dyer et 
al. (2015) and other parsers without additional re¬ 
sources. In §4 we confirm this hypothesis. 

4 Experiments 

We applied our parsing model and several varia¬ 
tions of it to several parsing tasks and report re- 



Figure 3: Character-based word embedding of the 
word party. This representation is used for both 
in-vocabulary and out-of-vocabulary words. 


suits below. 

4.1 Data 

In order to find out whether the character-based 
representations are capable of learning the mor¬ 
phology of words, we applied the parser to mor¬ 
phologically rich languages specifically the tree- 
banks of the SPMRL shared task (Seddah et 
al., 2013; Seddah and Tsar fat y, 2014): Arabic 
(Maamouri et al., 2004), Basque (Aduriz et al., 
2003), French (Abeille et al., 2003), German 
(Seeker and Kuhn, 2012), Hebrew (Sima'an et al., 
2001), Hungarian (Vincze et al., 2010), Korean 
(Choi, 2013), Polish (Swidzinski and Wolinski, 
2010) and Swedish (Nivre et al., 2006b). For all 
the corpora of the SPMRL Shared Task we used 
predicted POS tags as provided by the shared task 
organizers. 3 For these datasets, evaluation is cal¬ 
culated using eval 0 7 . pi, which includes punc¬ 
tuation. 

We also experimented with the Turkish de¬ 
pendency treebank 4 (Oflazer et al., 2003) of the 
CoNLL-X Shared Task (Buchholz and Marsi, 
2006). We used gold POS tags, as is common with 
the CoNLL-X data sets. 

To put our results in context with the most re¬ 
cent neural network transition-based parsers, we 
run the parser in the same Chinese and English 

Hhe POS tags were calculated with the MarMot tag¬ 
ger (Mtiller et al., 2013) by the best performing system of 
the SPMRL Shared Task (Bjorkelund et al., 2013). Arabic: 
97.38. Basque: 97.02. French: 97.61. German: 98.10. He¬ 
brew: 97.09. Hungarian: 98.72. Korean: 94.03. Polish: 
98.12. Swedish: 97.27. 

4 Since the Turkish dependency treebank does not have a 
development set, we extracted the last 150 sentences from the 
4996 sentences of the training set as a development set. 










setups as Chen and Manning (2014) and Dyer et 
al. (2015). For Chinese, we use the Penn Chi¬ 
nese Treebank 5.1 (CTB5) following Zhang and 
Clark (2008b), 5 with gold POS tags. For En¬ 
glish, we used the Stanford Dependency (SD) rep¬ 
resentation of the Penn Treebank 6 (Marcus et ah, 
1993; MarneITc et ah, 2006). 7 . Results for Turk¬ 
ish, Chinese, and English are calculated using the 
CoNLL-X eval .pi script, which ignores punc¬ 
tuation symbols. 

4.2 Experimental Configurations 

In order to isolate the improvements provided by 
the LSTM encodings of characters, we run the 
stack LSTM parser in the following configura¬ 
tions: 

• Words: words only, as in §3.1 (but without 
POS tags) 

• Chars: character-based representations of 
words with bidirectional LSTMs, as in §3.2 
(but without POS tags) 

• Words + POS: words and POS tags (§3.1) 

• Chars + POS: character-based representa¬ 
tions of words with bidirectional LSTMs plus 
POS tags (§3.2) 

None of the experimental configurations in¬ 
clude pretrained word-embeddings or any addi¬ 
tional data resources. All experiments include the 
SWAP transition, meaning that nonprojective trees 
can be produced in any language. 

Dimensionality. The full version of our parsing 
model sets dimensionalities as follows. LSTM 
hidden states arc of size 100, and we use two 
layers of LSTMs for each stack. Embeddings of 
the parser actions used in the composition func¬ 
tions have 20 dimensions, and the output embed¬ 
ding size is 20 dimensions. The learned word 
representations embeddings have 32 dimensions 
when used, while the character-based representa¬ 
tions have 100 dimensions, when used. Paid of 
speech embeddings have 12 dimensions. These di¬ 
mensionalities were chosen after running several 
tests with different values, but a more careful se¬ 
lection of these values would probably further im¬ 
prove results. 

’Training: 001-815, 1001-1136. Development: 886- 
931. 1148-1151. Test: 816-885, 1137-1147. 

6 Training: 02-21. Development: 22. Test: 23. 

7 The POS tags are predicted by using the Stanford Tagger 
(Toutanova et al., 2003) with an accuracy of 97.3%. 


4.3 Training Procedure 

Parameters are initialized randomly—refer to 
Dyer et al. (2015) for specifics—and optimized 
using stochastic gradient descent (without mini¬ 
batches) using derivatives of the negative log like¬ 
lihood of the sequence of parsing actions com¬ 
puted using backpropagation. Training is stopped 
when the learned model’s UAS stops improving 
on the development set, and this model is used to 
parse the test set. No pretraining of any parameters 
is done. 

4.4 Results and Discussion 

Tables 1 and 2 show the results of the parsers for 
the development sets and the final test sets, respec¬ 
tively. Most notable are improvements for agglu¬ 
tinative languages—Basque, Hungarian, Korean, 
and Turkish—both when POS tags arc included 
and when they arc not. Consistently, across all 
languages, Chars outperforms Words, suggest¬ 
ing that the character-level LSTMs arc learning 
representations that capture similar information to 
parts of speech. On average. Chars is on par with 
Words + POS, and the best average of labeled at¬ 
tachment scores is achieved with Chars + POS. 

It is common practice to encode morphological 
information in treebank POS tags; for instance, the 
Penn Treebank includes English number and tense 
(e.g., NNS is plural noun and VBD is verb in past 
tense). Even if our character-based representations 
arc capable of encoding the same kind of informa¬ 
tion, existing POS tags suffice for high accuracy. 
However, the POS tags in treebanks for morpho¬ 
logically rich languages do not seem to be enough. 

Swedish, English, and French use suffixes for 
the verb tenses and number, 8 while Hebrew uses 
prepositional particles rather than grammatical 
case. Tsar fat y (2006) and Cohen and Smith (2007) 
argued that, for Hebrew, determining the correct 
morphological segmentation is dependent on syn¬ 
tactic context. Our approach sidesteps this step, 
capturing the same kind of information in the vec¬ 
tors, and learning it from syntactic context. Even 
for Chinese, which is not morphologically rich. 
Chars shows a benefit over Words, perhaps by 
capturing regularities in syllable structure within 
words. 


s Tense and number features provide little improvement in 
a transition-based parser, compared with other features such 
as case, when the POS tags are included (Ballesteros, 2013). 



UAS 


LAS 


Language 

Words 

Chars 

Words 
+ POS 

Chars 
+ POS 

Arabic 

86.14 

87.20 

87.44 

87.07 

Basque 

78.42 

84.97 

83.49 

85.58 

French 

84.84 

86.21 

87.00 

86.33 

German 

88.14 

90.94 

91.16 

91.23 

Hebrew 

79.73 

79.92 

81.99 

80.76 

Hungarian 

72.38 

80.16 

78.47 

80.85 

Korean 

78.98 

88.98 

87.36 

89.14 

Polish 

73.29 

85.69 

89.32 

88.54 

Swedish 

73.44 

75.03 

80.02 

78.85 

Turkish 

71.10 

74.91 

77.13 

77.96 

Chinese 

79.43 

80.36 

85.98 

85.81 

English 

91.64 

91.98 

92.94 

92.49 

Average 

79.79 

83.86 

85.19 

85.38 


Language 

Words 

Chars 

Words 
+ POS 

Chars 
+ POS 

Arabic 

82.73 

84.34 

84.81 

84.36 

Basque 

67.08 

78.22 

74.31 

79.52 

French 

80.32 

81.70 

82.71 

81.51 

German 

85.36 

88.68 

89.04 

88.83 

Hebrew 

69.42 

70.58 

74.11 

72.18 

Hungarian 

62.14 

75.61 

69.50 

76.16 

Korean 

67.48 

86.80 

83.80 

86.88 

Polish 

65.13 

78.23 

81.84 

80.97 

Swedish 

64.77 

66.74 

72.09 

69.88 

Turkish 

53.98 

62.91 

62.30 

62.87 

Chinese 

75.64 

77.06 

84.36 

84.10 

English 

88.60 

89.58 

90.63 

90.08 

Average 

71.89 

78.37 

79.13 

79.78 


Table 1: Unlabeled attachment scores (left) and labeled attachment scores (right) on the development 
sets (not a standard development set for Turkish). In each table, the first two columns show the results of 
the parser with word lookup (Words) vs. character-based (Chars) representations. The last two columns 
add POS tags. Boldface shows the better result comparing Words vs. Chars and comparing Words + 
POS vs. Chars + POS. 


UAS 


Language 

Words 

Chars 

Words 
+ POS 

Chars 
+ POS 

Arabic 

85.21 

86.08 

86.05 

86.07 

Basque 

77.06 

85.19 

82.92 

85.22 

French 

83.74 

85.34 

86.15 

85.78 

German 

82.75 

86.80 

87.33 

87.26 

Hebrew 

77.62 

79.93 

80.68 

80.17 

Hungarian 

72.78 

80.35 

78.64 

80.92 

Korean 

78.70 

88.39 

86.85 

88.30 

Polish 

72.01 

83.44 

87.06 

85.97 

Swedish 

76.39 

79.18 

83.43 

83.24 

Turkish 

71.70 

76.32 

75.32 

76.34 

Chinese 

79.01 

79.94 

85.96 

85.30 

English 

91.16 

91.47 

92.57 

91.63 

Average 

79.01 

85.36 

84.41 

84.68 


LAS 


Language 

Words 

Chars 

Words 
+ POS 

Chars 
+ POS 

Arabic 

82.05 

83.41 

83.46 

83.40 

Basque 

66.61 

79.09 

73.56 

78.61 

French 

79.22 

80.92 

82.03 

81.08 

German 

79.15 

84.04 

84.62 

84.49 

Hebrew 

68.71 

71.26 

72.70 

72.26 

Hungarian 

61.93 

75.19 

69.31 

76.34 

Korean 

67.50 

86.27 

83.37 

86.21 

Polish 

63.96 

76.84 

79.83 

78.24 

Swedish 

67.69 

71.19 

76.40 

74.47 

Turkish 

54.55 

64.34 

61.22 

62.28 

Chinese 

74.79 

76.29 

84.40 

83.72 

English 

88.42 

88.94 

90.31 

89.44 

Average 

71.22 

78.15 

78.43 

79.21 


Table 2: Unlabeled attachment scores (left) and labeled attachment scores (right) on the test sets. In 
each table, the first two columns show the results of the parser with word lookup (Words) vs. character- 
based (Chars) representations. The last two columns add POS tags. Boldface shows the better result 
comparing Words vs. Chars and comparing Words + POS vs. Chars + POS. 


4.4.1 Learned Word Representations 

Figure 4 visualizes a sample of the character- 
based bidirectional LSTMs’s learned representa¬ 
tions (Chars). Clear clusters of past tense verbs, 
gerunds, and other syntactic classes are visible. 
The colors in the figure represent the most com¬ 
mon POS tag for each word. 

4.4.2 Out-of-Vocabulary Words 

The character-based representation for words is 
notably beneficial for out-of-vocabulary (OOV) 
words. We tested this specifically by comparing 
Chars to a model in which all OOVs arc replaced 
by the string “UNK” during parsing. This always 
has a negative effect on LAS (average —4.5 points, 


—2.8 UAS). Figure 5 shows how this drop varies 
with the development OOV rate across treebanks; 
most extreme is Korean, which drops 15.5 LAS. A 
similar, but less pronounced pattern, was observed 
for models that include POS. 

Interestingly, this artificially impoverished 
model is still consistently better than Words for 
all languages (e.g., for Korean, by 4 LAS). This 
implies that not all of the improvement is due to 
OOV words; statistical sharing across orthograph- 
ically close words is beneficial, as well. 

4.4.3 Computational Requirements 

The character-based representations make the 
parser slower, since they require composing the 
character-based bidirectional LSTMs for each 





Figure 5: On the .x-axis is the OOV rate in development data, by treebank; on the y-axis is the difference 
in development-set LAS between Chars model as described in §3.2 and one in which all OOV words are 
given a single representation. 
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Figure 4: Character-based word representations 
of 30 random words from the English develop¬ 
ment set (Chars). Dots in red represent past tense 
verbs; dots in orange represent gerund verbs; dots 
in black represent present tense verbs; dots in blue 
represent adjectives; dots in green represent ad¬ 
verbs; dots in yellow represent singular nouns; 
dots in brown represent plural nouns. The visu¬ 
alization was produced using t-SNE; see http: 
//lvdmaaten.github.io/tsne/. 


word of the input sentence; however, at test time 
these results could be cached. On average. Words 
parses a sentence in 44 ms, whileChars needs 130 
ms. 9 Training time is affected by the same cons¬ 

9 We are using a machine with 32 Intel Xeon CPU E5- 
2650 at 2.00GHz; the parser runs on a single core. 


tant, needing some hours to have a competitive 
model. In terms of memory, Words requires on 
average 300 MB of main memory for both train¬ 
ing and parsing, while Chars requires 450 MB. 

4.4.4 Comparison with State-of-the-Art 

Table 3 shows a comparison with state-of-the- 
art parsers. We include greedy transition-based 
parsers that, like ours, do not apply a beam 
search (Zhang and Clark, 2008b) or a dynamic 
oracle (Goldberg and Nivre, 2013). For all the 
SPMRL languages we show the results of Balles¬ 
teros (2013), who reported results after carrying 
out a careful automatic morphological feature se¬ 
lection experiment. For Turkish, we show the re¬ 
sults of Nivre et al. (2006a) which also carried 
out a careful manual morphological feature se¬ 
lection. Our parser outperforms these in most 
cases. Since those systems rely on morphological 
features, we believe that this comparison shows 
even more that the character-based representations 
are capturing morphological information, though 
without explicit morphological features. For En¬ 
glish and Chinese, we report (Dyer et ah, 2015) 
which is Words + POS but with pretrained word 
embeddings. 

We also show the best reported results on 
these datasets. For the SPMRL data sets, the 
best performing system of the shared task is ei¬ 
ther Bjorkelund et al. (2013) or Bjorkelund et al. 
(2014), which are consistently better than our sys- 





j This Work 

Best Greedy Result 

| Best Published Result 

Language 

UAS 

LAS 

System 

UAS 

LAS 

System 

UAS 

LAS 

System 

Arabic 

86.08 

83.41 

Chars 

84.57 

81.90 

B’13 

88.32 

86.21 

B+’13 

Basque 

85.22 

78.61 

Chars + POS 

84.33 

78.58 

B’13 

89.96 

85.70 

B+’14 

French 

86.15 

82.03 

Words + POS 

83.35 

77.98 

B’13 

89.02 

85.66 

B+T4 

German 

87.33 

84.62 

Words + POS 

85.38 

82.75 

B’13 

91.64 

89.65 

B+’13 

Hebrew 

80.68 

72.70 

Words + POS 

79.89 

73.01 

B’13 

87.41 

81.65 

B+’14 

Hungarian 

80.92 

76.34 

Chars + POS 

83.71 

79.63 

B’13 

89.81 

86.13 

B+’13 

Korean 

88.39 

86.27 

Chars 

85.72 

82.06 

B’13 

89.10 

87.27 

B+T4 

Polish 

87.06 

79.83 

Words + POS 

85.80 

79.89 

B’13 

91.75 

87.07 

B+’13 

Swedish 

83.43 

76.40 

Words + POS 

83.20 

75.82 

B’13 

88.48 

82.75 

B+’14 

Turkish 

76.32 

64.34 

Chars 

75.82 

65.68 

N+’06a 

77.55 

n/a 

K+’IO 

Chinese 

85.96 

84.40 

Words + POS 

87.20 

85.70 

D+" 15 

87.20 

85.70 

D+T5 

English 

92.57 

90.31 

Words + POS 

93.10 

90.90 

D+" 15 

94.08 

92.19 

W+’15 


Table 3: Test-set performance of our best results (according to UAS or LAS, whichever has the larger 
difference), compared to state-of-the-art greedy transition-based parsers (“Best Greedy Result”) and best 
results reported (“Best Published Result”). All of the systems we compare against use explicit mor¬ 
phological features and/or one of the following: pretrained word embeddings, unlabeled data and a 
combination of parsers; our models do not. B'13 is Ballesteros (2013); N+'06a is Nivre et al. (2006a); 
D+’15 is Dyer et al. (2015); B+' 13 is Bjorkelund et al. (2013); B+' 14 is Bjorkelund et al. (2014); K+' 10 
is Koo et al. (2010); W+' 15 is Weiss et al. (2015). 


tern for ah languages. Note that the comparison 
is harsh to our system, which does not use unla¬ 
beled data or explicit morphological features nor 
any combination of different parsers. For Turkish, 
we report the results of Koo et al. (2010), which 
only reported unlabeled attachment scores. For 
English, we report (Weiss et al., 2015) and for Chi¬ 
nese, we report (Dyer et al., 2015) which is Words 
+ POS but with pretrained word embeddings. 

5 Related Work 

Character-based representations have been ex¬ 
plored in other NLP tasks; for instance, dos San¬ 
tos and Zadrozny (2014) and dos Santos and 
Guimaraes (2015) learned character-level neural 
representations for POS tagging and named entity 
recognition, getting a large error reduction in both 
tasks. Our approach is similar to theirs. Others 
have used character-based models as features to 
improve existing models. For instance, Chrupala 
(2014) used character-based recurrent neural net¬ 
works to normalize tweets. 

Botha and Blunsom (2014) show that stems, 
prefixes and suffixes can be used to learn useful 
word representations but relying on an external 
morphological analyzer. That is, they learn the 
morpheme-meaning relationship with an additive 
model, whereas we do not need a morphological 
analyzer. Similarly, Chen et al. (2015) proposed 
joint learning of character and word embeddings 
for Chinese, claiming that characters contain rich 
information. 


Methods for joint morphological disambigua¬ 
tion and parsing have been widely explored Tsar- 
faty (2006; Cohen and Smith (2007; Goldberg 
and Tsar fat y (2008; Goldberg and Elhadad (2011). 
More recently, Bohnet et al. (2013) presented an 
arc-standard transition-based parser that performs 
competitively for joint morphological tagging and 
dependency parsing for richly inflected languages, 
such as Czech, Finnish, German, Hungarian, and 
Russian. Our model seeks to achieve a simi¬ 
lar benefit to parsing without explicitly reasoning 
about the internal structure of words. 

Zhang et al. (2013) presented efforts on Chinese 
parsing with characters showing that Chinese can 
be parsed at the character level, and that Chinese 
word segmentation is useful for predicting the cor¬ 
rect POS tags (Zhang and Clark, 2008a). 

To the best of our knowledge, previous work has 
not used character-based embeddings to improve 
dependency parsers, as done in this paper. 

6 Conclusion 

We have presented several interesting findings. 
First, we add new evidence that character-based 
representations are useful for NLP tasks. In this 
paper, we demonstrate that they arc useful for 
transition-based dependency parsing, since they 
arc capable of capturing morphological informa¬ 
tion crucial for analyzing syntax. 

The improvements provided by the character- 
based representations using bidirectional LSTMs 
arc strong for agglutinative languages, such as 




Basque, Hungarian, Korean, and Turkish, compar¬ 
ing favorably to POS tags as encoded in those lan¬ 
guages’ currently available treehanks. This out¬ 
come is important, since annotating morphologi¬ 
cal information for a treebank is expensive. Our 
finding suggests that the best investment of anno¬ 
tation effort may be in dependencies, leaving mor¬ 
phological features to be learned implicitly from 
strings. 

The character-based representations arc also a 
way of overcoming the out-of-vocabulary prob¬ 
lem; without any additional resources, they en¬ 
able the parser to substantially improve the per¬ 
formance when OOV rates arc high. We expect 
that, in conjunction with a pretraing regime, or in 
conjunction with distributional word embeddings, 
further improvements could be realized. 
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